Prepare Dataframe

Clustering

Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:

  1. Rows are observations (individuals) and columns are variables.

  2. Any missing value in the data must be removed or estimated.

  3. The data must be standardized (i.e., scaled) to make variables comparable.

I group the data by medium & year to compute the mean value for each p_group:

Unweighted DF

##                    Bündnis 90/ Die Grüne     CDU/CSU         FDP
## BamS                         -0.08350076 -0.02256243 -0.03682351
## Bericht aus Berlin           -0.11080468 -0.11537941 -0.15316751
## Berlin direkt                -0.09875372 -0.08490027 -0.08889407
## Berliner                     -0.06484421 -0.08807705 -0.06177991
## Bild                         -0.10511309 -0.04832445 -0.04318226
## Die Welt                     -0.10129520 -0.04735683 -0.03924910
##                    Linke/PDS/WASG         SPD
## BamS                  -0.17131152 -0.08143064
## Bericht aus Berlin    -0.14847533 -0.12518081
## Berlin direkt         -0.15777266 -0.09758545
## Berliner              -0.08150096 -0.07650623
## Bild                  -0.16098418 -0.08496746
## Die Welt              -0.12364047 -0.09560428

Weighted DF

##                    Bündnis 90/ Die Grüne     CDU/CSU          FDP
## BamS                        -0.007828121 -0.01052960 -0.005050128
## Bericht aus Berlin          -0.010457562 -0.04776818 -0.026401083
## Berlin direkt               -0.009451222 -0.03705477 -0.012294661
## Berliner                    -0.012230058 -0.02979307 -0.005277301
## Bild                        -0.008668780 -0.02165700 -0.004579914
## Die Welt                    -0.013492593 -0.01908624 -0.003600620
##                    Linke/PDS/WASG         SPD
## BamS                 -0.003450739 -0.02298595
## Bericht aus Berlin   -0.011988974 -0.02985531
## Berlin direkt        -0.009969052 -0.02599203
## Berliner             -0.005272176 -0.02471322
## Bild                 -0.004835171 -0.02831759
## Die Welt             -0.005276756 -0.03148756
m.unweighted <- na.omit(m.unweighted)
m.weighted <- na.omit(m.weighted)

K-Means Clustering

K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

The Basic Idea

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

\[ W(C_k)=\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] where:

  • \(x_i\) is a data point belonging to Cluster \(C_k\)
  • \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)

Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers (\(\mu_k\)) is minimized.

The object function to be minimized is the total within-cluster sum of square:

\[ \text{tot.withiness} = \sum^k_{k=1}W(C_k)=\sum^k_{k=1}\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] ### K-means Algorithm

K-means algorithm can be summarized as follows:

  1. Specify the number of clusters (K) to be created (by the analyst).

  2. Select randomly k objects from the data set as the initial cluster centers or means.

  3. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid.

  4. For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a \(Kth\) cluster is a vector of length \(p\) containing the means of all variables for the observations in the \(kth\) cluster; \(p\) is the number of variables.

  5. Iteratively minimize the total within sum of square (Equation above). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached.

The output of kmeans is a list with several bits of information. The most important being:

  • cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
  • centers: A matrix of cluster centers.
  • totss: The total sum of squares.
  • withinss: Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
  • betweenss: The between-cluster sum of squares, i.e. \(totss-tot.withinss\).
  • size: The number of points in each cluster.

If we print the results we’ll see that our groupings resulted in 3 cluster sizes of 29, 49, 287. We see the cluster centers (means) for the three groups across the four variables (Bündnis 90/ Die Grüne, CDU/CSU, FDP, SPD). We also get the cluster assignment for each observation (i.e. BamS was assigned to cluster 3 in year 2001, Bericht aus Berlin was assigned to cluster 1 in 2005, etc.).

We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

Unweighted

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine

Weighted